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Abstract — This article presents an efficient method to calculate 
in parallel the values of the Smarandache function S(i), i = 
1,2 The value S(i) can be sequentially found with a 
complexity of . The computation has an important constraint, 
which is to have consecutive values computed by the same pro- 
cessor. This makes the dynamic scheduling methods inapplicable. 
The proposed solution is based on a Balanced Workload Block 
Scheduling method. Experiments show that the method is efficient 
and generates a good load balance. 

I. Introduction 

The Smarandache function [7] is a relatively new function 
in Number Theory and yet there are already a number of 
algorithms for its computation. It is the intention of this article 
to develop an efficient algorithm to compute in parallel all the 
values {S(i), i = 1,2, ...,n}. This is an important problem 
often occurring in real computation for example in checking 
conjectures on S. 

To begin, the Smarandache function [7] S : N* — > N is 
defined as 

S(n) = min{fc G N \ n\kl} (1) 

An important property of this function is given by the follow- 
ing: 

(Vo, b G N*)(a, b) = 1 S(a • b) = ma x{5(o), S(b)}. (2) 

Expanding on this, it is clear that 

SiPi 1 ■ ••• -pf) =ma x{S(p kl ),...,S(j>j j )}. (3) 

Therefore, when trying to evaluate the value of the function 
at n it is possible to use the prime decomposition of n to 
reduce the computation. Equation (1) gives a simple formula 
for S(p k ): 

k = j2 d i- E -zj^ s (p k ) = Y, d i-p i - 

i= 1 P i= 1 

There have been several studies to show the connection 
between the function S and prime numbers. It has been proven 
by Ford [2] that the values of S are almost always prime, 
satisfying 

lim I {i<n: S(i) prime}\ = Q (5) 

n Tl 

Several sequential methods to compute the Smarandache 
function have emerged since its initial definition in 1980. 
Ibstedt [3], [4] developed an algorithm based on Equations (3) 
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and (4) without any study of the complexity of this algorithm. 
Later, Power et.al. [6] analyzed this algorithm and found that 
the complexity is The U Basic implementation that 

was used by Ibstedt has proved to be efficient and useful 
especially for large values of n. Subsequently, Tabirca [9] 
studied a simple algorithm based on Equation 1 by considering 
the sequence Xk = k\ mod n. This proves a rather inefficient 
computation that is impractical for large values of n. It was 
shown that the computation has a complexity of 0(S(n)). 
However, studies (10], [11] and [5] find that the average 
complexity of this algorithm is 0( p^)- 



static long Value (long p, long k) { 
long 1, j, value=0; 
long dl[] = new long [1000] ; 
long d2 [] = new long [1000] ; 
dl [0] = 1 ; d2 [0] =p ; 
for(int 1=0 ;dl [1] <=k; 1++) { 
dl [1+1] =l+p*dl [1] ; 
d2 [1 + 1] =p*d2 [1] ; 

} 

for(l--, j=l; j>=0; j--) { 
d=p/dl [j] ; 
p=p%dl [ j ] ; 
value+=d*d2 [j] ; 

} 

return value; 

} 

Fig. 1. The procedure for S(p k ). 

A. An Efficient Sequential Algorithm 

Performing the computation of the Smarandache function 
can be done sequentially by developing an algorithm based 
on Equations (3) and (4). Clearly, if an efficient method to 
calculate the function on a prime power exists, it is then 
easy to extend this to the remaining integers. It was this that 
prompted Ibstedt to develop an algorithm for the computation 
of the Smarandache function. This algorithm will be briefly 
examined in this section. 

In Equation (4) (d/,d(_i, . . . ,di) is the representation 
of k in the generalized base 1, p p _ 1 , ■ ■ ■ , p so that 
(di, di-i, . . . , di) is the representation of S(p k ) in the gen- 
eralized base p,p 2 , . . . ,p l . This gives a relationship between 
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public static long S (final long n) { 
long d, valueMax=0, s=-l; 
if (n==l) return 0; 
long p[] = new long [1000] ; 
long k[] = new long [1000] ; 
long value [] = new long [1000] ; 

for (d=2 ; d<n; d++) 
if (n % d == 0 ) { 
s++ ;p [s] =d; 

f or (k [s] =0 ; n%d==0 ; k [s] ++ , n/=d) ; 
value [s] =Value (p [s] , k [s] ) ; 

} 

for (j=0; j<=S; j++) 

if (valueMax<value [ j ] ) 
valueMax=value [ j ] ; 
return valueMax; 

} 



Fig. 2. The procedure for S(p k ). 



p k and S(p k ). With this it is possible to write a method to 
calculate the Smarandache function on a prime power. Once 
this is in place, it is then possible to calculate the function on 
any integer. 

Note that a prime decomposition algorithm is needed for 
the computation of the Smarandache function which, for these 
purposes, can be a simple trial division algorithm. Once 
a prime decomposition of n = p^ 1 - ... ■ p • 1 is available 
Equation (3) gives S(n) = max.{S(p kl ), . . . , S(pj 3 )}. This 
is described in Figure 2. 

II. Computing in Parallel 

While performing the calculation of S(n) in parallel it is 
possible, see Power et.al. [6], that is not the purpose of this 
article. Instead, it is desired that the computation of {S(i), i = 
1,2, be performed in parallel. Let us suppose that this 
is done by the doall loop 

do par i=l,n 

calculate S ( i ) 
end do 

which is computed on a parallel machine with p processors 
Pi,P 2 , ...,P p . The value S(i) is found sequentially by calling 
the function S from Figure 2 and this is done with a workload 
of Wi = |^— , i = 2,3, An important requirement of this 
computation is to have consecutive iterations computed by the 
same processor. This often occurs when it is needed to check 
conjectures involving consecutive terms of S. 

Computing the above doall loop is a classical scheduling 
problem in parallel computation. Scheduling methods find a 
mapping of the iterations onto the processors. This means 
that the set of indices {1,2, ...,n} is partitioned into p sets 
{Sj, j = 1,2, —,p}. Scheduling methods are classified 
into two main categories depending on when the partition 
is found. Static Scheduling Methods generate the partition 



during compile time while Dynamic Scheduling Methods find 
it during run time. The main advantage of the latter is that they 
can detect when a processor becomes idle and assign iterations 
to it. Studies have shown that Dynamic Scheduling Methods 
achieve a good load balance of the workloads. However, they 
produce small scheduling overheads. 

On the other hand Static Scheduling Methods do not give 
any scheduling overheads but they usually give a poor imbal- 
ance of the workloads. The simplest way to schedule statically 
the iterations is to assign ^ consecutive iterations to each 
processor. In this case processor j receives the iterations 
^~^' ra + 1, 1 h n +2 ■ • • DDk. This method, which is called 

Uniform Block Scheduling gives good load balance when the 
workloads {wi,W2,-i™n} are similar. When the workloads 
increase or decrease the method is clearly inefficient because 
there is one processor that gets all the biggest n/p workloads. 
Cyclic Scheduling corrects this inconvenience by distributing 
the iterations in a cyclic fashion so that two consecutive big 
workloads are not assigned to the same processor. The method 
allocates to processor j the iterations {j, j+p, j+2-p,- ■ ■ ,j+ 
■p}. Certainly, cyclic scheduling offers an efficient load 
balancing when the workloads decrease or increase. 

Tabirca [13] proposed a recent static scheduling method 
named Balanced Workload Block Scheduling (BWBS 1). This 
is a block scheduling in which processor j receives the 
consecutive iterations {lj,lj + 1, . . . , hj } so that its workload 
is balanced. Hence, the scheduling is defined by the lower and 
upper bounds {(lj,hj), j — 1, 2, ...,p) so that 

— 1, hp — n, lj — hj — i T 1, j — 2, .. ,p. 




Suppose that there is an estimation or a formula for the 
workloads {u>i, u> 2 , ..., w n }. Therefore, the workload for the 
entire loop is given by w = w > an d the average workload 

per processor is given by 



p “ 

1 i — 0 



Clearly, good scheduling should give bounds ( lj,hj ) for 
processor j such that 



h i n 

~ - • := W, Vi = 1,2, .... 

i — lj i — 1 



P- 



To evaluate bounds for the computation, two functions are 
needed. Firstly, by extending the inferior part function, define 



f[](x) = k f(k) <x < f(k+ 1). (6) 

Tabirca et al [13], show that if both / | and the function f(h) = 
J2k = t v ' k ex t st or can he calculated then the upper bounds are 
given by 

hj = f[](W + f(h j - 1 )),j = 1,2 ,...,p. (7) 

However, the method can still be applied when there are not 
formulas for these two functions. In this case a pre-processing 
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step is required to calculate the average workload W and the 
upper bounds {hj, j = 1,2 using 

h h + 1 

hj = h <=> ^ Wi < W < ^ Wi- ( 8 ) 

% — lj i — lj 

The pre-processing step however gives a scheduling overhead 

of 0(£). 

Tabirca [12] proposed an improvement on this method 
considering the partial sum that is closest to j ■ W 

h m-i 

hj = h 44 ^ Wi < j ■ W < ^2 w i ■ 

i=l i— 1 

In this case the upper bounds are given by 

hj = f[](j ■ W),j = 1,2, (10) 

This Balanced Workload Block Scheduling (BWBS 2) method 
has been proven to be marginally better than the initial one. 
For the loop we study the workloads w i = 0, Wi = 
i = 2,3 increase so that the Unifrom scheduling 
does not give an efficient solution. Certainly, the Dynamic 
Scheduling or Cyclic methods can be applied to obtain a better 
load balance. Unfortunately, they are not suitable because the 
processors do not get consecutive iterations. Therefore, the 
Balanced Workload Block Scheduling methods remain to give 
an efficient solution for our problem. Since the sum ^™=o \dgl 
does not have a formula or a simple approximation, the pre- 
processing step must be applied to achieve the scheduling 
bounds. 





Pi 


P 2 


P 3 


Pi 


Uniform 


225.25 


611.45 


971.45 


1318.54 


BWBS 1 


787.81 


780.78 


777.61 


785.18 


BWBS 2 


782.65 


781.52 


782.05 


782.34 



TABLE I 

Execution Times on Processors. 
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Fig. 3. Execution Times on Processors. 
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III. Experimental Results 
In this section some experimental results are outlined to 
show how the problem is solved in parallel. The computation 
has been performed on a 100 node Beowulf cluster. The 
machine consists of 50 Dell Poweredge 1655MC servers each 
of them with a dual Pentium III processor (1.26GHz, 512K 
cache, 1 GB of RAM). 

The Uniform and Balanced Workload Block Scheduling 
methods are considered to schedule the loop iterations. To test 
these approaches, we generate 5(*)Vi < 100,000,000 and 
check whether the equation S(i) = S(i + 1) has solutions. 
This is an old standing conjecture that has been checked by 
Ibstedt [3] for all the numbers i < 1,000,000 . It has been 
conjectured that the equation has no solutions. 





P= 1 


P = 2 


p = 4 


P = 8 


p = 16 


p = 32 


Uniform 


2925.62 


2088.53 


1215.34 


718.54 


434.92 


271.82 


BWBS 1 


2925.72 


1543.87 


787.81 


412.18 


237.61 


148.50 


BWBS 2 


2925.68 


1524.25 


782.65 


397.34 


210.53 


131.58 



TABLE II 

Variation of Execution Times. 




Fig. 4. Variation of Execution Times. 

The first test presents the workload distribution on proces- 
sors. For that the loop is scheduled on 4 processors and the 
computation time on each processor is measured. This gives 
an estimation of workload balance of each method. Table 1 
and Figure 3 show the variation of these execution times. It 
can be seen that the Uniform Scheduling method generate a 
huge imbalance where processor 4 is more than 6 times loaded 
than processor 1. On the other hand the BWBS methods give 
an efficient load balance with a marginal advantage for the 
second one. The desired effect of BWBS balancing, which is 
to get times on each processor as ‘nearly’ equal as possible is 
clearly visible from the diagram. 

The second test investigates the variation of the overall 
execution times when the number of processors vary. The 
above loop has been run using p = 1,2,4,8,16 processors. 
The variation of the execution times is presented in Table 
2. Examining Figure 4 shows, unsurprisingly, that the BWBS 
bounds offer not only the best balance but also the quickest 
computation. 
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IV. Final Conclusion 

This article has presented how the values of the Smaran- 
dache function can be found in parallel. It has been required 
that consecutive values have to be calculated by the same 
processor. This has restricted the scheduling methods that 
could have been used for the computation. A variation of 
Balanced Workload Block Scheduling has been used to achieve 
efficient computation. That has been possible only because the 
number of operations to compute S(i) is known. 

Based on this method several conjectures from Smaran- 
dache’s Open Problem list [8] have been verified for all 
values up to 1,000,000,000 using the BWBS scheduling. 
Unfortunately, no counterexample has been found to disprove 
any of them so that we can say that they are true at least for 
all the values under one billion. This type of computation can 
also be used to generate in parallel the values of some other 
Number Theory functions e.g. Erdos’ or Euler’s. 
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